BMC Genomics — Latest Matching Preprints

1

An optimized ribodepletion approach for C. elegans RNA-sequencing libraries

Barrett, A.; McWhirter, R.; Taylor, S. R.; Weinreb, A.; Miller, D.; Hammarlund, M.

2021-01-05 genomics 10.1101/2021.01.04.425342 medRxiv

Top 0.1%

52.8%

Show abstract

A recent and powerful technique is to obtain transcriptomes from rare cell populations, such as single neurons in C. elegans, by enriching dissociated cells using fluorescent sorting. However, these cell samples often have low yields of RNA that present challenges in library preparation. This can lead to PCR duplicates, noisy gene expression for lowly expressed genes, and other issues that limit endpoint analysis. Further, some common resources, such as sequence specific kits for removing ribosomal RNA, are not optimized for non-mammalian samples. To optimize library construction for such challenging samples, we compared two approaches for building RNAseq libraries from less than 10 nanograms of C. elegans RNA: SMARTSeq V4 (Takara), a widely used kit for selecting poly-adenylated transcripts; and SoLo Ovation (Tecan Genomics), a newly developed ribodepletion-based approach. For ribodepletion, we used a custom kit of 200 probes designed to match C. elegans rRNA gene sequences. We found that SoLo Ovation, in combination with our custom C. elegans probe set for rRNA depletion, detects an expanded set of noncoding RNAs, shows reduced noise in lowly expressed genes, and more accurately counts expression of long genes. The approach described here should be broadly useful for similar efforts to analyze transcriptomics when RNA is limiting.

2

Basal Contamination of Sequencing: Lessons from the GTEx dataset

Nieuwenhuis, T. O.; Yang, S.; Verma, R. X.; Pillalamarri, V.; Arking, D.; Rosenberg, A. Z.; McCall, M. N.; Halushka, M. K.

2020-01-02 genomics 10.1101/602367 medRxiv

Top 0.1%

37.3%

Show abstract

One of the challenges of next generation sequencing (NGS) is read contamination. We used the Genotype-Tissue Expression (GTEx) project, a large, diverse, and robustly generated dataset, to understand the factors that contribute to contamination. We obtained GTEx datasets and technical metadata and validating RNA-Seq from other studies. Of 48 analyzed tissues in GTEx, 26 had variant co-expression clusters of four known highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and/or CELA3A). Fourteen additional highly expressed genes from other tissues also indicated contamination. Sample contamination by non-native genes was associated with a sample being sequenced on the same day as a tissue that natively expressed those genes. This was highly significant for pancreas and esophagus genes (linear model, p=9.5e-237 and p=5e-260 respectively). Nine SNPs in four genes shown to contaminate non-native tissues demonstrated allelic differences between DNA-based genotypes and contaminated sample RNA-based genotypes, validating the contamination. Low-level contamination affected 4,497 (39.6%) samples (defined as 10 PRSS1 TPM). It also led [≥] to eQTL assignments in inappropriate tissues among these 18 genes. We note this type of contamination occurs widely, impacting bulk and single cell data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses. Awareness of this process is necessary to avoid assigning inaccurate importance to low-level gene expression in inappropriate tissues and cells.

3

Performance of a scalable extraction-free RNA-seq method

Ghimire, S.; Stewart, C. G.; Thurman, A. L.; Pezzulo, A. A.

2021-01-22 genomics 10.1101/2021.01.22.427817 medRxiv

Top 0.1%

33.6%

Show abstract

RNA sequencing enables high-contents/high-complexity measurements in small molecule screens performed on biological samples. Whereas the costs of DNA sequencing and the complexity of RNA-seq library preparation and analysis have decreased consistently, RNA extraction remains a significant bottleneck for RNA-seq of hundreds of samples in parallel. Direct use of cell lysate for RNA-seq library prep is common in single cell RNA-seq but not in bulk RNA-seq protocols. Recently published protocols suggest that direct lysis is compatible with simplified RNA-seq library prep. Here, we evaluate the performance of a bulk RNA-seq library prep protocol optimized for analysis of many samples of adherent cultured cells in parallel. We combine a low-cost direct lysis buffer compatible with cDNA synthesis ("in-lysate cDNA synthesis") with Smart-3SEQ and examine the effects of calmidazolium and fludrocortisone-induced perturbation of primary human dermal fibroblasts. We compared this method to normalized purified RNA inputs from matching samples followed by Smart-3SEQ or Illumina TruSeq library prep. Our results show that whereas variable RNA inputs for each sample in the in-lysate cDNA synthesis protocol result in variable sequencing depth, this had minimal effect on data quality, measurement of gene expression patterns, or generation of differentially expressed gene lists. We found that in-lysate cDNA synthesis combined with Smart-3SEQ RNA-seq library prep allows generation of high-quality data when compared to library prep with extracted RNA, or when compared to Illumina TruSeq. Our data show that small molecule screens using RNA-seq are feasible at low reagent and time costs.

4

Significantly Improved Mouse and Rat Genome Annotation Using Sequence Read Archive RNA-seq Data

Meng, F.; Turner, D. L.; Hagenauer, M. H.; Watson, S.; Akil, H.

2026-03-09 genomics 10.64898/2026.03.06.709975 medRxiv

Top 0.1%

33.4%

Show abstract

To detect currently unannotated genes with low expression levels with high sensitivity and accuracy, we developed a new exon->gene->transcript annotation pipeline that can identify previously undetected multi-exon transcripts using large volumes of RNA-Seq data. Our pipeline incorporates three new algorithms: 1) model-based spliced exon detection, 2) exon-to-gene assignment across multiple tissue/datasets through exon community discovery, and 3) ranking top transcripts by a stepwise minimum flow procedure. The design of our pipeline allowed us to leverage hundreds of Tbases of public RNA-seq data as input to improve mouse and rat genome annotation. Using this data, our pipeline identified close to 15K and 21K unannotated genes in GENCODE M37 and ENSEMBL 114 for mouse and rat, respectively. Each species also gained over 200K predicted transcripts containing at least one new exon, although most were transcripts from GENCODE/ENSEMBL annotated genes with newly assigned exons. To make our genome annotation available for common use, we have packaged this new annotation in standard file formats for the analysis of bulk and single cell RNA-seq data (GTF, 10X genome files). We have also provided two use examples which demonstrate the utility of our newly annotated genes in functional analyses, showing that their expression can be differentially regulated in relationship to cell type and selective breeding. Due to the efficiency provided by our pipeline, we expect that as new RNA-seq data become available in the coming years it will significantly benefit rat gene/transcript annotation, eventually enabling us to approach the target of complete gene and transcript annotation.

5

A novel reusable transcriptome-wide association study workflow used to map key genes linked to important cattle traits

Jayaraman, S.; Chitneedi, P. K.; Kadri, N. K.; Costa-Monteiro-Moreira, G.; Salavati, M.; Charlier, C.; Boichard, D.; Sanchez, M.-P.; Pausch, H.; Kuehn, C.; Prendergast, J. G.; Clark, E. L.

2025-06-12 genomics 10.1101/2025.06.10.658680 medRxiv

Top 0.1%

33.0%

Show abstract

Transcriptome-wide association studies (TWAS) are a powerful approach for studying the genes underlying complex traits by directly integrating GWAS and gene expression datasets. In cattle, they have been previously applied to identify genes driving fertility, milk production, and health. However, these studies have also highlighted several challenges, from difficulties in reproducing these complex analyses to limitations from poor genotype calls, especially when called directly from RNA sequencing data. To address these and other challenges, for the H2020 BovReg Project, we have developed a streamlined, species-agnostic, and reusable Nextflow TWAS workflow to integrate transcriptomic and GWAS summary statistic datasets. Our workflow first generates accurate genotype calls and gene expression prediction models from transcriptomic datasets and then applies these tools to impute gene expression levels into GWAS cohorts, enabling the association of genes with traits of interest. We explore optimal strategies for calling genetic variants directly from transcriptomic data and illustrate that using imputation approaches specifically designed for low-pass sequencing data can improve variant calling over previously adopted methods. We demonstrate the utility of our TWAS workflow by applying it to both novel and publicly available GWAS cohorts for cattle, detecting novel gene-trait associations for complex traits. Using a new transcriptome annotation of the cattle genome generated for the BovReg project we also illustrate how previously un-assayable associations can be detected. The results and the workflow we present, provide a new resource for the community and contribute to a better understanding of the molecular drivers of complex traits in cattle with the goal of eventually leveraging this information in future breeding decisions.

6

Critical Differential Expression Assessment for Individual Bulk RNA-Seq Projects

Warden, C. D.; Wu, X.

2024-02-12 genomics 10.1101/2024.02.10.579728 medRxiv

Top 0.1%

32.3%

Show abstract

Finding the right balance of quality and quantity can be important, and it is essential that project quality does not drop below the level where important main conclusions are missed or misstated. We use knock-out and over-expression studies as a simplification to test recovery of a known causal gene in RNA-Seq cell line experiments. When single-end RNA-Seq reads are aligned with STAR and quantified with htseq-count, we found potential value in testing the use of the Generalized Linear Model (GLM) implementation of edgeR with robust dispersion estimation more frequently for either single-variate or multi-variate 2-group comparisons (with the possibility of defining criteria less stringent than |fold-change| > 1.5 and FDR < 0.05). When considering a limited number of patient sample comparisons with larger sample size, there might be some decreased variability between methods (except for DESeq1). However, at the same time, the ranking of the gene identified using immunohistochemistry (for ER/PR/HER2 in breast cancer samples from The Cancer Genome Atlas) showed as possible shift in performance compared to the cell line comparisons, potentially highlighting utility for standard statistical tests and/or limma-based analysis with larger sample sizes. If this continues to be true in additional studies and comparisons, then that could be consistent with the possibility that it may be important to allocate time for potential methods troubleshooting for genomics projects. Analysis of public data presented in this study does not consider all experimental designs, and presentation of downstream analysis is limited. So, any estimate from this simplification would be an underestimation of the true need for some methods testing for every project. Additionally, this set of independent cell line experiments has a limitation in being able to determine the frequency of missing a highly important gene if the problem is rare (such as 10% or lower). For example, if there was an assumption that only one method can be tested for "initial" analysis, then it is not completely clear to the extent that using edgeR-robust might perform better than DESeq2 in the cell line experiments. Importantly, we do not wish to cause undue concern, and we believe that it should often be possible to define a gene expression differential expression workflow that is suitable for some purposes for many samples. Nevertheless, at the same time, we provide a variety of measures that we believe emphasize the need to critically assess every individual project and maximize confidence in published results.

7

Mapping splice QTLs reveals distinct transcriptional and post-transcriptional regulatory variation of gene expression in pigs

Zhang, F.; Velez-Irizarry, D.; Ernst, C. W.; Huang, W.

2022-11-22 genomics 10.1101/2022.11.20.517281 medRxiv

Top 0.1%

28.3%

Show abstract

BackgroundAlternative splicing is an important step in gene expression, generating multiple isoforms for the same genes and greatly expanding the diversity of proteomes. Genetic variation in alternative splicing contributes to phenotypic diversity in natural populations. However, the genetic basis of variation in alternative splicing in livestock animals including pigs remains poorly understood. ResultsIn this study, using a Duroc x Pietrain F2 pig population, we performed genome-wide analysis of alternative splicing estimated from stranded RNA-Seq data in skeletal muscle. We characterized the genetic architecture of alternative splicing and compared its basic features with overall gene expression. We detected a large number of novel alternative splicing events that were not previously annotated. We found heritability of quantitative alternative splicing scores (percent spliced in or PSI) to be lower than that of overall gene expression. In addition, heritabilities showed little correlation between alternative splicing and overall gene expression. Finally, we mapped expression QTLs (eQTLs) and splice QTLs (sQTLs) and found them to be largely non-overlapping. ConclusionsOur results suggest that regulatory variation exists at multiple levels and that their genetic controls are distinct, offering opportunities for genetic improvement.

8

Genome assembly variation and its implications for gene discovery in nematode species

Mariene, G. M.; Wasmuth, J. D.

2024-02-29 genomics 10.1101/2024.02.26.582167 medRxiv

Top 0.1%

27.7%

Show abstract

Genome assemblers are a critical component of genome science, but the choice of assembly software and protocols can be daunting. Here, we investigate genome assembly variation and its implications for gene discovery across three nematode species--Caenorhabditis bovis, Haemonchus contortus, and Heligmosomoides bakeri--highlighting the critical interplay between assembly choice and downstream genomic analysis. Selecting popular genome assemblers, we generated multiple assemblies for each species, analyzing their structure, completeness, and effect on gene family analysis. Our findings demonstrate that assembly variations can significantly affect gene family composition, with notable differences in critical gene families like cyp, gst, ugt, and nhr. Despite broadly similar performance using various assembly metrics, comparisons of assemblies with a single species revealed underlying structural rearrangements and inconsistencies in gene content. This emphasizes the imperative for continuous refinement of genomic resources. Our findings advocate for a cautious and informed approach to genome assembly and annotation to ensure reliable and insightful genomic interpretations.

9

Unraveling the phylogenetic signal of gene expression from single-cell RNA-seq data

Alves, J. M.; Tomas, L.; Posada, D.

2024-04-20 genomics 10.1101/2024.04.17.589871 medRxiv

Top 0.1%

26.1%

Show abstract

Single-cell RNA sequencing (scRNA-seq) has transformed our understanding of phenotypic heterogeneity. Although the predominant focus of scRNA-seq analyses has been assessing gene expression changes, several approaches have been proposed in recent years to identify changes at the DNA level from scRNA-seq data. In this study, we evaluated the relative performance of six strategies for calling single-nucleotide variants from scRNA-seq data using 381 single-cell transcriptomes from five cancer patients. Specifically, we focused on the quality of the inferred genotypes and the resulting single-cell phylogenies. We found that scAllele, Monopogen, and Monovar consistently returned phylogenetically informative genotype calls, providing more precise signals of discrimination between tumor and normal cells within heterogeneous samples and among distinct subclonal lineages in longitudinal samples. In addition, we evaluated the evolution of gene expression along the cell phylogenies. While most transcriptomic variation was very plastic and did not correlate with the cell phylogeny, a group of genes associated with cell cycle processes showed a strong phylogenetic signal in one of the patients, underscoring a potential link between gene expression patterns and lineage-specific traits in the context of cancer progression. In summary, our study highlights the potential of scRNA-seq data for inferring cell phylogenies to decipher the evolutionary dynamics of cell populations.

10

The complete genome of the KOLF2.1J reference iPSC line

Alvarez Jerez, P.; Rhie, A.; Kim, J.; Hebbar, P.; Nag, S.; Antipov, D.; Koren, S.; Lara, E.; Beilina, A.; Hansen, N. F.; Arber, C. F.; Zulueta, J.; Wild-Crea, P.; Patel, D.; Hickey, G.; Waltz, B.; Malik, L.; Skarnes, W. C.; Reed, X.; Genner, R.; Daida, K.; Pantazis, C. B.; Grenn, F.; Nalls, M. A.; Billingsley, K.; Fossati, V.; Wray, S.; Ward, M.; Ryten, M.; Cookson, M. R.; Jain, M.; Paten, B.; Phillippy, A. M.; Blauwendraat, C.

2026-03-09 genomics 10.64898/2026.03.06.710144 medRxiv

Top 0.1%

23.5%

Show abstract

While induced pluripotent stem cells (iPSCs) have gained popularity in studying neurodegenerative diseases, the heterogeneity of stem cells used across studies impacts cross-study comparison. The iPSC Neurodegenerative Disease Initiative (iNDI) selected the KOLF2.1J cell line and prioritized its use as a reference standard for studying the effects of pathogenic variants on cell biology due to its stability and neutral neurodegenerative disease genetic risk. This cell line, and its derivatives expressing over 100 variants related to Alzheimers disease, Parkinsons disease, and other neurological diseases, are available for academic and industry access. Current genomic data analyses are limited by the use of a human reference genome that does not capture the complete genetic background of a given iPSC line. While in the future this issue may be partially mitigated by the creation of a comprehensive human pangenome, previous work has shown that generating custom genomes is of value both to characterize the variation present and to serve as a more appropriate genomic reference. Here, we generated and characterized a custom complete genome assembly from KOLF2.1J. Mapping of sequencing reads to a personalized diploid assembly results in more comprehensive mapping compared to traditional linear references (i.e GRCh38). In addition, we provide a comprehensive custom gene annotation along with isoform expression and differential methylation analyses across multiple cell types. The assembly and all additional data is browsable and publicly available. This resource will enable more accurate investigation of the KOLF2.1J cell line and any genomics data generated compared to using traditional generalized references, while also serving as a foundational approach for establishing custom reference assemblies for other high-value iPSC lines.

11

High throughput single-cell RNA sequencing of intact adult cardiomyocytes and non-myocytes using a split-pool approach

Hu, Y.; Gurung, R.; Mueller, S.; Villanueva, E.; Stenzig, J.; Rayan, N.; Luu, T. D. A.; Nur, S.; Tan, B.; Liu, B.; Yu, H.; Choi, H.; Foo, R.; Ackers-Johnson, M. A.

2026-04-30 cell biology 10.64898/2026.04.28.721288 medRxiv

Top 0.1%

23.5%

Show abstract

MOTIVATIONAdult cardiomyocytes are difficult to profile by whole-cell single-cell RNA sequencing because of their large size and fragility, which make them poorly compatible with standard workflows. Current approaches for adult cardiomyocyte transcriptomics often require a trade-off between data quality and throughput, thus, studies instead rely heavily on sequencing of nuclei alone. Therefore, we set out to develop a high-quality and scalable workflow for adult heart cells using in-cell ligation and split-pool barcoding strategies to address this methodological gap. This workflow may be further generalisable to other large cell types or samples containing cell populations with highly unequal RNA content. SUMMARYAdult cardiomyocytes are difficult to profile by whole-cell single-cell RNA sequencing (scRNA-seq). Here, we developed a high-quality and scalable workflow for adult heart cells using in-cell ligation and split-pool barcoding. We identified per-cell RNA content as a significant variable that must be accounted for. Separation of cardiomyocytes (large cells) and non-cardiomyocytes (small cells) before library construction, and allocation of deeper sequencing to cardiomyocytes, produced high-quality whole-cell datasets for both compartments. Compared with single-nucleus RNA sequencing, whole-cell cardiomyocyte profiling better recovered metabolic, mitochondrial, cytoplasmic translational, and contractile gene programs. This workflow provides a practical method for scalable, high-quality cardiomyocyte whole-cell scRNA-seq and offers general strategies for other large cell types or samples containing cell populations with highly unequal RNA content.

12

Assembly of a pangenome uncovers novel non-reference unique insertion sequences in cattle highlighting their genetic diversity

Sorin, V.; Besnard, F.; Capitan, A.; Grohs, C.; Naji, M. M.; Escouflaire, C.; Fritz, S.; Lledo, J.; Eche, C.; Iampietro, C.; Donnadieu, C.; Milan, D.; Drouilhet, L.; Tosser-Klopp, G.; Boichard, D.; Klopp, C.; Sanchez, M.-P.; Boussaha, M.

2025-12-04 genetics 10.64898/2025.12.02.691810 medRxiv

Top 0.1%

23.4%

Show abstract

BackgroundThe current cattle reference genome, derived from a single Hereford cow, does not capture the full spectrum of genetic diversity present within the species. Moreover, detecting structural variations (SVs [≥] 50 nucleotides long) remains challenging using only standard approaches of either short or long-read sequence approaches against a linear reference genome. Recent advances in long-read sequencing technologies and graph-based assembly now enable the construction of breed-specific pangenomes, revealing previously uncharacterized genomic regions that may contribute to important agricultural traits. ResultsIn this study we constructed a cattle pangenome graph using 16 high-quality haplotype-resolved genome assemblies originating from nine breeds representing the diversity of French cattle populations, and including Yak (Bos grunniens) as a close outgroup species. Using a trio-based strategy combined with complementary sequencing technologies and bioinformatics methods, we identified and characterized 101,219 structural variations. Of these, 33,634 were classified as non-reference unique insertions (NRUIs), adding several megabases of novel genomic sequences absent from the current Hereford reference genome. Analysis of the distribution of these NRUIs revealed significant genome-wide enrichment within QTL regions associated with milk production and morphological traits, suggesting their contribution to the genetic basis of economically relevant phenotypes. Furthermore, their functional annotation highlighted two NRUIs located within the intronic regions of ARMH3 and EPHA5, both specific to the Normande breed and significantly associated with milk production and morphological traits, respectively. ConclusionsOur findings demonstrate the value of pangenome approaches to uncover functionally relevant SVs, particularly NRUIs, that are systematically not in the current reference genome. By linking these variants to economically important traits, our work underscores the need to incorporate breed diversity into future genomic analyses and reference-building efforts in cattle.

13

Customizable host and viral transcript enrichment using CRISPR-Cas9 long-read sequencing for isoform discovery and validation

Nguyen, A. N. T.; Zhang, J.; Zhang, S.; Pitt, M. E.; Ganesamoorthy, D.; Fritzlar, S.; Chang, J. J. Y.; Londrigan, S.; Coin, L. J. M.

2025-04-14 genomics 10.1101/2025.04.11.648353 medRxiv

Top 0.1%

23.3%

Show abstract

Long-read RNA sequencing has been broadly utilized to examine the diversity of transcriptomes, understand differential expression and discover novel transcript isoforms. One of the major limitations of whole transcriptome sequencing is the difficulty in obtaining sufficient depth for low abundant transcripts. Methods which address this are either difficult to scale or customize: long- range PCR is customizable but difficult to scale beyond a few targets; probe hybridization panels are suited for scaling but require substantial investment to customize. In this study, we adopted RNA-guided CRISPR-Cas9 nuclease-based enrichment to target specific human and SARS-CoV-2 transcripts followed by long-read sequencing, utilizing minimal number of guide RNAs per target isoform. Our findings demonstrate that the CRISPR-Cas system is a highly effective method for customizable long-read sequencing of target transcripts while maintaining the accuracy of relative gene expression levels. The results highlight a valuable method for future research on transcript enrichment for isoform identification and low abundance transcript detection in infectious disease diagnosis.

14

k-mer spectra and allelic coverage analyses reveal pervasive polymorphic duplications in Ostrea edulis

Colston-Nepali, L.; Bierne, N.; Lapegue, S.

2025-07-16 genomics 10.1101/2025.06.24.661118 medRxiv

Top 0.1%

23.0%

Show abstract

Marine bivalves are known for their high genetic diversity and potentially high genetic load, as well as their structurally complex genomes. The European flat oyster, Ostrea edulis, is no exception. In this study we performed both k-mer based and reference-based analyses on short-read data with high coverage ([~]70-160X) from five individuals. Up to one-third of heterozygous SNPs showed allelic coverage fractions that deviated from expectation. Despite no evidence of recent whole-genome duplication, we detected significant signals for genomic regions of increased ploidy, and 15% and 25% of k-mer pairs distant by a single nucleotide displayed triploid-like and tetraploid-like profiles respectively. In a diploid species, this indicates the presence of heterozygote and homozygote duplications. These results were confirmed by coverage-based genotype inference, which showed that deviant SNPs are located in polymorphic duplications. While duplications affected genic and non-genic regions similarly, their genomic distribution varied across chromosomes, with a notable enrichment in non-metacentric chromosomes. The latter being prone to somatic aneuploidy, we suggest that aneuploidy could act as an indirect driver, with duplications potentially buffering the expression of recessive deleterious mutations in aneuploid cells. Finally, we consider how such widespread structural variation may complicate population genomic analyses in O. edulis and other marine bivalves.

15

Isoform-Level Analysis of 10x Genomics Single-Cell cDNA Libraries from Cultured K562 Cells Using Long-Read Sequencing

Bachmann, J.; Olsen, R.-A.; Bhatt, A.; Crang, N.; Contreras-Lopez, O.; Mezger, A.

2025-08-15 genomics 10.1101/2025.08.12.668929 medRxiv

Top 0.1%

23.0%

Show abstract

Integration of Oxford Nanopore Technologies (ONT) long-read sequencing with 10x Genomics single-cell cDNA libraries enables novel transcript detection, isoform analysis and captures full-length gene body coverage. The purpose of the study was the comparison of three approaches for sequencing 10x Genomics Chromium Single Cell cDNA libraries using long-read sequencing: single-cell full-length transcript sequencing by sampling (FLT-seq), the cDNA-PCR Sequencing Kit (SQK-PCS111) and the PCR Expansion Kit (EXP-PCA001). Our aim was to evaluate their efficiency in enriching full-length cDNA fragments, identifying barcodes, detecting novel isoforms and mutations, and characterizing transcript coverage profiles.

16

Estimating and correcting index hopping misassignments in single-cell RNA-seq data

Miao, L.; Collado, L.; Barkdull, S.; Saito, Y.; Jo, J.-H.; Han, J.; DellOrso, S.; Kelly, M. C.; Conlan, S.; Kong, H. H.; Brownell, I.

2024-10-24 genomics 10.1101/2024.10.21.619353 medRxiv

Top 0.1%

22.9%

Show abstract

BackgroundIndex hopping causes read assignment errors in data from multiplexed sequencing libraries. This issue has become more prevalent with the widespread use of high-capacity sequencers and highly multiplexed single-cell RNA sequencing (scRNA- seq). ResultsWe conducted deep, plate-based scRNA-seq on a mixed population of mouse skin cells. Analysis of transcriptomes from 1152 cells identified four distinct cell types. To estimate the error rate in sample assignment due to index hopping, we employed differential expression analysis to identify signature genes that were highly and specifically expressed in each cell type. We quantified the proportion of misassigned reads by examining the detection rates of signature genes in other cell types. Remarkably, regardless of gene expression levels, we estimated that 0.65% of reads per gene were assigned to incorrect cell across our data. To computationally compensate for index hopping, we developed a simple correction method wherein, for each gene, 0.65% of the librarys average expression level was subtracted from the expression in each cell. This correction had notable effects on transcriptome analyses, including increased cell-cell clustering distance and alterations in intermediate state assignments of cell differentiation. ConclusionsIndex hopping misassignments are measurable and can impact the experimental interpretation of sequencing results. We devised a straightforward method to estimate and correct for the index hopping rate by quantifying misassigned genes in distinct cell types within an scRNA-seq library. This approach can be applied to any barcoded, multiplexed scRNA-seq library containing cells with distinct expression profiles, allowing for correction of the expression matrix before conducting biological analysis.

17

Deciphering Bacterial and Archaeal Transcriptional Dark Matter and Its Architectural Complexity

Mattick, J. S. A.; Bromley, R. E.; Watson, K. J.; Adkins, R. S.; Holt, C. I.; Lebov, J. F.; Sparklin, B. C.; Tyson, T. S.; Rasko, D. A.; Hotopp, J. C. D.

2024-04-03 genomics 10.1101/2024.04.02.587803 medRxiv

Top 0.1%

22.7%

Show abstract

Transcripts are potential therapeutic targets, yet bacterial transcripts remain biological dark matter with uncharacterized biodiversity. We developed and applied an algorithm to predict transcripts for Escherichia coli K12 and E2348/69 strains (Bacteria:gamma-Proteobacteria) with newly generated ONT direct RNA sequencing data while predicting transcripts for Listeria monocytogenes strains Scott A and RO15 (Bacteria:Firmicute), Pseudomonas aeruginosa strains SG17M and NN2 strains (Bacteria:gamma-Proteobacteria), and Haloferax volcanii (Archaea:Halobacteria) using publicly available data. From >5 million E. coli K12 ONT direct RNA sequencing reads, 2,484 mRNAs are predicted and contain more than half of the predicted E. coli proteins. While the number of predicted transcripts varied by strain based on the amount of sequence data used for the predictions, across all strains examined, the average size of the predicted mRNAs is 1.6-1.7 kbp while the median size of the predicted bacterial 5-and 3-UTRs are 30-90 bp. Given the lack of bacterial and archaeal transcript annotation, most predictions are of novel transcripts, but we also predicted many previously characterized mRNAs and ncRNAs, including post-transcriptionally generated transcripts and small RNAs associated with pathogenesis in the E. coli E2348/69 LEE pathogenicity islands. We predicted small transcripts in the 100-200 bp range as well as >10 kbp transcripts for all strains, with the longest transcript for two of the seven strains being the nuo operon transcript, and for another two strains it was a phage/prophage transcript. This quick, easy, inexpensive, and reproducible method will facilitate the presentation of operons, transcripts, and UTR predictions alongside CDS and protein predictions in bacterial genome annotation as important resources for the research community.

18

Comparative Analysis of Single-Nucleus and Single-Cell RNA Sequencing in Human Bone Marrow Mononuclear Cells: Methodological Insights and Trade-offs

Ghamsari, R.; de Graaf, C. A.; Thijssen, R.; You, Y.; Lovell, N. H.; Alinejad-Rokny, H.; Ritchie, M. E.

2025-09-09 bioinformatics 10.1101/2025.09.08.675012 medRxiv

Top 0.1%

22.7%

Show abstract

Bone marrow mononuclear cells (BMMCs) are a heterogeneous pool of hematopoietic progenitors and mature immune cells that collectively sustain hematopoiesis and coordinate immune responses. The bone marrow serves not only as the primary site for blood cell production but also as a niche for various disorders, including blood cancers. Advances in single-cell RNA sequencing (scRNA-seq) and single-nucleus RNA sequencing (snRNA-seq) have significantly enhanced our understanding of the cellular biology and molecular dynamics within this complex microenvironment. The choice between these two approaches, however, is often shaped or constrained by the study design, such as research objectives, sample type, and preservation conditions. Consequently, methodological differences in library preparation and transcript capture efficiency can introduce systematic biases that complicate downstream analyses and interpretation, underscoring the need to identify and account for method-specific features. In this study, we conducted a comparative analysis of matched snRNA-seq and scRNA-seq datasets from 11 pairs of healthy donor bone marrow mononuclear cell samples, generated using the popular 10x Genomics platform. We evaluated method-specific biases using multiple quality metrics and compared cell type proportions and transcriptomic signatures captured by each approach. Integrative analysis of these datasets is feasible but not advisable due to systematic gene length biases that were observed between these approaches. Our results showed that despite inherent differences in library complexity, both protocols reliably captured all major cell types. This comparative analysis highlights intrinsic differences between snRNA-seq and scRNA-seq data, providing valuable insights into their respective advantages, limitations, and trade-offs. These findings can assist researchers in selecting the optimal method tailored to specific biological questions and sample characteristics, and also enable more method-aware data analysis and interpretation.

19

Improved target capture with lower hybridization temperatures for invertebrate loci with different baiting strategies: a case study of the leaf-footed bugs and allies (Hemiptera: Coreoidea)

Forthman, M.; Gordon, E. R. L.; Kimball, R. T.

2022-03-04 genomics 10.1101/2022.03.02.482542 medRxiv

Top 0.1%

22.6%

Show abstract

Target capture approaches are widely used in phylogenomic studies, yet only four experimental comparisons of a critical parameter, hybridization temperature, have been published. These studies provide conflicting conclusions regarding the benefits of lower temperatures during target capture, and none include invertebrates where bait-target divergences may be higher than seen in vertebrate capture studies. Most capture studies use a fixed hybridization temperature of 65{degrees}C to maximize the proportion of on-target data, but many invertebrate capture studies report low locus recovery. Lower hybridization temperatures, which might improve locus recovery, are not commonly employed in invertebrate capture studies. We used leaf-footed bugs and relatives (Hemiptera: Coreoidea) to investigate the effect of hybridization temperature on capture success of ultraconserved elements (UCE) targeted by previously published baits derived from divergent hemipteran genomes and other loci targeted by newly designed baits derived from less divergent coreoid transcriptomes. We found touchdown capture approaches with lower hybridization temperatures generally resulted in lower proportions of on-target reads and lower read depth but were associated with more contigs and improved recovery of UCE loci. Low temperatures were also associated with increased numbers of putative paralogs of UCE loci. Hybridization temperatures did not generally affect recovery of newly targeted loci, which we attributed to their lower bait-target divergences (compared to higher divergences between UCE baits and targets) and greater bait tiling density. Thus, optimizing in vitro target capture conditions to accommodate low hybridization temperatures can provide a cost-effective, widely applicable solution to improve recovery of protein-coding loci in invertebrates.

20

Detection and evaluation of copy number variation using both linked-read and short-read sequencing in New Zealand dairy cattle

Wang, Y.; Nugroho, T.; Johnson, T. J. J.; Couldrey, C.; Harris, B. L.

2026-04-23 bioinformatics 10.64898/2026.04.20.718595 medRxiv

Top 0.1%

22.5%

Show abstract

In recent years, genetic studies have made significant progress in identifying single-nucleotide polymorphisms (SNPs) associated with cattle health and production traits. However, it is still challenging to identify and validate more complicated forms of variation, such as copy number variation (CNV) and other types of structural variation (SV). In this study, SV regions were identified using 37 New Zealand dairy cattle with linked-read sequence data. A transmission-based framework was used to validate these variants at the population scale. 62,438 putative autosomal SV regions were identified with the LongRanger pipeline following the 10x Genomics recommendations. Copy number states for these regions were subsequently estimated via a read-depth based genotyping method using CNVpytor in a population-representative cohort of 2306 animals using Illumina short-read sequencing technology. Mendelian inheritance of copy number states was assessed using linear mixed models incorporating pedigree information, and transmission levels were used to quantify the biological validity of each CNV region. Transmission levels ranged widely, with a mean of 0.5162 across all regions, where higher transmission levels were proportionally enriched for larger SVs. A total of 7218 CNV regions exhibited high transmission levels (>0.9), indicating strong evidence of inheritance. Among these, 7136 overlapped CNV regions reported in one or more public datasets, while 82 high-confidence regions represent previously unreported variants. High-transmission CNV regions tended to show clear, discrete inheritance patterns in trio families, providing the biological evidence that these CNVs are inherited within the population. Together, these results demonstrate that integrating linked-read sequencing with population-scale transmission-based validation provides a robust framework for identifying high-confidence CNV regions. This catalogue of validated CNV regions represents an important resource for downstream functional analyses and the incorporation of structural variation into genomic selection and breeding programs.